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ABSTRACT 

This  report  summarizes  research  conducted  at  the  Center 
for  Reliable  Computing  with  the  support  of  the  Air  Force 
Office  of  Scientific  Research  from  1  April  1977  to  30  April 
1979.  Results  and  current  work  in  various  aspects  of 
user  system  reliability  evaluation  are  described. 
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INTRODUCTION 


This  final  scientific  report  describes  various 
activities  at  the  Center  for  Reliable  Computing  (CRC)  of 
Stanford  University's  Computer  Systems  Laboratory.  Section 
2  describes  various  research  topics  which  appeared  in 
publications  by  CRC  personnel.  Section  3  contains  a  list  of 
technical  conferences  and  meetings  attended  by  CRC 
personnel.  Section  4  contains  a  list  of  CRC  personel 
involved  in  reliability  studies.  The  list  of  references  in 
Section  S  contains  both  references  and  CRC  publications. 
Throughout  this  report,  references  to  CRC  research 
publications  are  identified  by  a  * .  Publications  supported 
by  this  grant  are  identified  by  a  +. 

The  principal  research  results  have  produced 
mathemat ica 1 1 y  tractable  means  to  analyze  gracefully 
degrading  computing  systems  from  a  reliability,  performance, 
and  d iagnosibil ity  point  of  view.  Several  actual  computing 
systems  were  used  as  examples  of  techniques  described  in 
this  report.  These  systems  included: 

1.  DAIS,  a  multiprocessor  for  avionics  applications, 

2.  PRIME,  a  multiprocessor  developed  at  the 
University  of  California  at  Berkeley, 

3.  the  Stanford  Linear  Accelerator  Center's  Triplex 
multiprocessor,  and 

4.  SIRU,  a  dual  redundant  computer  for  navigation  and 
control  in  avionics. 
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The  testing  of  digital  circuits  for  both  permanent  and 
intermittent  failures  was  also  extensively  studied  resulting 
in  reasonable  and  practical  methods  for  random  and 
determ  ini st ic  testing  strategies.  Design  techniques  for 
producing  self-checking  circuits  which  avoid  duplication 
were  shown  to  be  useful  in  combinational  circuit  design. 
Redundant  digital  systems,  resilient  to  synchronization  and 
data  matching  problems,  were  described  as  well  as  design 
techniques  for  achieving  tolerance  to  these  failures. 
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2. 


RESEARCH  DESCRIPTION 


This  section  describes  research  efforts  focussed  on 
various  aspects  of  the  reliability  evaluation  of  computer 
systems.  .'.'cCluskey*  [1977  +  ]  discussed  the  importance  of 
reliability  and  maintainability  in  the  design  of  computer 
architectures.  Reliability  evaluation  of  f a ul t- resi stan t 
systems  concentrated  on  a  study  of  gracefully  degrading 
(also  called  fail-softly)  computing  systems,  simulation 
techniques  for  reliability  evaluation,  and  communication 
strategies  for  f a ul t- tol er an t  computer  networks .  System 
level  reliability  analysis  included  studies  of  various 
multiprocessor  computer  systems.  Computer  system  component 
reliability  evaluation  concentrated  on  the  problems  involved 
in  testing  and  designing  digital  circuits. 

2.1  RELIABILITY  EVALUATION  OF  FAULT-RESISTANT  COMPUTER 
SYSTEMS 

The  major  modeling  effort  concentrated  on  fault- 
resistant  computer  systems.  Fail-softly  or  gracefully 
degrading  computing  systems  react  to  a  detected  failure  by 
reconfiguring  to  a  state  which  might  have  a  decreased  level 
of  performance.  For  example,  if  a  single  processor  of  a 
multiprocessor  system  fails,  the  system  may  continue  to 
operate  without  the  faulty  processor,  but  will  have 
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decreased  performance  until  the  processor  can  be  repaired 
and  then  reconfigured  into  the  system.  Three  areas 
particularly  relevant  to  the  analysis  of  such  systems  were 
studied  at  the  Center  for  Reliable  Computing: 

1.  analytic  modeling, 

2.  appropriate  reliability  measures,  and 

3.  diagnosis  models. 

Other  research  explored  the  use  of  a  general  purpose 
simulator  to  study  simulation  as  a  means  of  reliability 
evaluation  in  digital  systems.  The  problems  of  fault- 
tolerant  communication  in  computer  networks  were  also 
exam ined  . 

The  use  of  Markov  models  for  the  reliability  evaluation 
of  gracefully  degrading  systems  was  described  by  Losq* 
[1977].  His  model  analyzed  the  effects  of  failures  on  such 
systems  and  took  into  account  the  internal  structure  of  the 
hardware,  the  characteristics  of  various  detection 
mechanisms,  and  the  unreliability  of  software.  The  system 
was  partitioned  into  resources,  e .g .  processing,  memory, 
and  operating  system  software,  and  each  resource  was  modeled 
independently.  System  optimization  gave  the  best  number  of 
modules  in  a  resource  and  demonstrated  the  trade-offs 
between  hardware  and  software  detection  mechanisms.  The 
model  provided  values  for  the  system  availability,  mean  time 
before  failure  (MTBF)  ,  and  the  proportion  of  time  that  the 
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system  spent  in  degraded  modes. 


Many  important  sources  of  system  failure  are  not 
adequately  modeled  by  techniques  which  require  the 
assumption  of  constant  failure  rates.  Beaudry*  [1979R+] 
discusses  the  probabilistic  structure  of  several  sources  of 
computing  system  failure.  Non-homog eneo us  Poisson  processes 
provided  the  mathematical  framework  in  which  to  model  these 
failure  mechanisms.  The  detection  latency  of  a  physical 
failure  is  the  time  which  elapses  between  the  occurrence  of 
tiie  physical  failure  and  the  detection  of  that  failure  at 
the  system  level.  The  rate  at  which  detections  of  physical 
failures  occur  is  dependent  on  the  failure  rate,  the  system 
testing  strategy,  and,  very  importantly,  on  the  usage 
characteristics  of  the  failed  circuit  or  system.  If  the 
system  usage  varies  during  operation  of  a  computing  system 
Lho.i  the  ueeection  r..Lc  ’..'ill  also  vary.  Non-homog eneons 
Poisson  processes  provided  a  model  for  such  time-dependent 
phenomena  in  computing  systems  with  varying  usage 
characteristics. 

Beaudry*  [ 1975 , 1977 , 197SA+, 1978D+J  studied  measures 
which  were  particularly  relevant  to  the  reliability 
evaluation  of  gracefully  degrading  systems.  Differences  in 
performance  levels  led  to  consideration  of  the  computation 
capacity  of  a  system  as  the  amount  of  useful  computation 
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(considering  such  performance  factors  as  execution  speed, 
throughput,  response  time,  overhead  and  user  demand)  per 
unit  time  available  on  the  system.  The  mean  computation 
before  failure,  MCBF,  and  the  computation  availability  were 
defined  for  such  systems.  These  measures  evaluated  the 
system  in  terms  of  its  capacity  to  execute  computing  tasks. 
Markov  models  permitted  the  development  of  measures  to 
analyze  the  effects  of  distributing  the  computation  load  in 
a  multiprocessor  or  gracefully  degrading  system.  These 
measures  were  especially  useful  when  evaluating  systems 
where  both  reliability  and  performance  were  important. 

The  problem  of  self-diagnosis  of  multi-unit  digital  systems 
was  considered  by  Blount*  [ 1 977A , 1977B , 1978D+ ,  197CE+] .  In 
so^h  systems,  each  unit  can  test  other  units  and/or  be 
tested  by  other  units.  The  system  test  consists  of  running 
and  compiling  test  results  from  all  unit  tests  in  the 
system.  As  a  result  of  the  system  test,  a  decision  is  made 
regarding  the  fault  status  (faulty  or  fault-free)  of  each 
unit.  In  Blount*  [  .1 97 £C  1  ,  197 8 £  +  ]  ,  procedures  were  developed 
for  calculating  probabilistic  d  i  ag  no  sab  i  1  i  t:  y  measures  for 
systems  that  could  be  modeled  in  earlier  g  r  aph  -  theo  r  e  t  i  c 
diagnosis  models.  This  model  was  extended  by  allowing  the 
modeling  of  self-testing  units  and  including  the 
uncertainties  and  imperfections  of  the  testing  process. 
Procedures  were  given  for  calculating: 
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1. 


f a  ul  t  cond i t i o  n 


the  probability  that  a  specified 
is  correctly  diagnosed,  and 

2.  the  probability  that  correct  diagnosis  is  performed 
vhen  a  system  test  is  performed. 

A  procedure  was  also  given  for  deriving  a  diagnosis 
strategy  that  results  in  optimal  system-wide  d i ay  no s i b i 1 i t y . 
This  model  formed  the  basis  for  a  study  of  diagnosis  in 
fail  softly  or  gracefully  degrading  computer  systems.  If  a 
system  is  to  survive  a  failure,  four  steps  must  be 
successfully  taken: 

1.  the  fault  must  be  detected, 

2.  the  fault  must  bo  accurately  diagnosed, 

3.  the  faulty  unit  must  be  removed  from  the  system, 
and 

A.  the  data  base  and  system  software  must  be  returned 
to  some  acceptable  state. 

Blount*  [ 1 97PA+, ] 978B+, 197SE+]  allowed  the  decision  step  of 
the  diagnosis  procedure  to  be  implemented  by  a  set  of  faulty 
and/or  fault-free  units,  thereby  modeling  a  possible  source 
of  error  in  the  diagnosis  process. 

Th.cr.ps  on*  [  1  97  7B  +  ,  1 97  7D+ ]  described  a  simulation 
pachago  designed  to  evaluate  the  reliability  of  digital 
syst  ■:?.  The  simulator  was  designed  to  model  many  different- 
types  of  systems,  at  varying  levels  of  detail.  The  user  was 
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given  the  capability  to  use  the  elements  of  the  model  in  the 
way  best  suited  to  simulating  the  operation  of  a  system  in 
the  presence  of  faults.  The  simulation  package  then 
generated  random  faults  in  a  model  using  a  Monte  Carlo 
analysis  to  obtain  reliability  curves.  When  compared  with 
other  types  of  simulators,  this  simulator  was  found  to 
provide  a  greater  degree  of  flexibility  in  specifying  the 
mode1  for  the  system. 

Betancourt*  [1970+]  presented  an  excellent  survey  of 
available  reliability  evaluation  computer  programs.  These 
programs  can  be  classified  into  three  main  categories: 

1.  Fault  tree  analysis, 

2.  Reliability  equations,  and 

3.  Markov  chain  models. 

References  are  provided  for  each  procram  as  well  as  a 
description  of  the  computer  language,  input,  output  and 
underlying  model. 

The  quality  of  service  provided  by  a  computer  network 
is  extremely  sensitive  to  the  performance  of  the 
inter-computer  communication  network.  The  loss  of 
communication  links  or  centers  causes  transmission  delay, 
congestion,  and  can  disconnect  networks.  Losq* 
[ 1 97 8A- , 1 2 7 25+]  studied  structures  for  communication 
networks  that  minimise  the  effects  of  failures  on 


8 


performance.  He  proposed  a  measure  for  the  f a ’’1 1- to  1  erance 
of  s  to  r  e-and- fo  r  wa  rd  packet  switching  networks  which  defined 
operational  limits  within  a  worst  case  response  time. 
Properties  of  the  networks  led  to  simplification  of  routing 
algorithms  and  reconfiguration  following  failures. 

2.2  SYSTEM  LEVEL  RELIABILITY  ANALYSIS 

Several  actual  computer  systems  have  been  analyzed 
using  both  analytical  modeling  and  simulation.  Losq* 
[1977]  applied  his  model  to  the  PRIME  system  to  derive  the 
MTBF  and  availability  of  that  system.  The  analysis  examined 
both  hardware  and  software  failures  and  studied  the  effect 
of  testing  and  hardware  detection  mechanisms  on  the  overall 
system  reliability. 

Beaudry*  [1S78A+, 1978B+, 1978C+]  examined  the  effects  of 
variations  in  system  workloads  on  the  overall  system 
reliability  and  performance.  A  relationship  was 
demonstrated  between  tiie  system  failure  rate  and  the  user 
demand  for  system  resources.  In  particular,  a  strong 
relationship  was  shown  between  the  arrival  rate  of  batch 
jobs  in  a  large  multiprocessor  and  the  failure  rate  of  that 
multiprocessor  computing  system.  Beaudry*  [ 1978A+, 1978E+] 
presented  an  extensive  statistical  analysis  of  failures  in 
the  SLAC  (Stanford  Linear  Accelerator  Center)  Triplex 
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multiprocessor.  The  analysis  demonstrated  the  effect  of 
system  workload  on  the  overall  reliability. 

Performance-related  as  well  as  traditional  reliability 
measures  were  calculated;  hardware,  software  and 
operatcr-.r.duced  system  failures  were  taken  into  account. 

The  behavior  of  time-dependent  failure  rates  v/as 
studied  by  Beaudry*  [197SB  +  ]  in  two  computer  systems,  the 
SLAC  Triplex  multiprocessor  and  DAIS,  a  multiprocessor 
system  for  avionics  applications.  Recurrent  failures  occur 
when  the  recovery  process  following  a  failure  in  a  computing 
system  is  incomplete  or  incorrect.  Such  failures  were  shown 
to  be  a  significant  source  of  system  dov/n  time  in  the  SLAC 
Triplex  multiprocessor.  Recurrent  failures  were  also  shown 
to  be  a  critical  consideration  in  the  DAIS  architecture. 

Blount*  [ 1 97BA+, 1978B+, 1978E+]  studied  the  PRIME  sysi 
and  determined  the  sensitivity  of  the  system  d iagno ~ i bi 1 i  / 
to  various  levels  of  test  quality.  He  took  into  account  th 
imperfections  in  both  the  system  test  and  the  decision  steps 
of  the  diagnosis  process.  This  study  demonstrated  the  use 
of  fail-softly  systems  diagnosis  models  as  a  research  tool 
for  the  general  study  of  diagnosis  techniques. 

Tr.e  SIRU  dual  redundant  system  was  extensively  analyzed 
by  Thompson*  [ 1 977A+ ,  1977C  +  ]  using  simulation  to  evaluate 
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the  proposed  design  of  this  dual  computer  system.  The 
system  was  modeled  in  sufficient  detail  to  study  the  effects 
of  certain  design  changes  on  the  overall  reliability.  This 
study  demonstrated  that  this  simulation  technique  could  be 
applied  to  complex  systems  and  derived  the  conditions 
necessary  to  obtain  a  desired  confidence  in  the  results  of 
the  analysis. 

2.3  COMPONENT  LEVEL  RELIABILITY  ANALYSIS 

Digital  circuits  must  be  designed  to  provide  for  both 
testability  and  maintainability.  In  [McCluskey*,  197SA+, 
1978B+] ,  current  techniques  and  issues  in  design  for 
maintainability  and  testability  are  surveyed.  CRC  studied 
various  testing  strategies  for  intermittent  and  permanent 
failure  detection,  as  well  as  issues  involved  in  the 
reliable  synchronization  of  redundant  digital  systems. 

In  digital  circuits,  intermittent  failures  are 
extremely  difficult  to  detect.  Their  effects  can  disappear 
when  a  test  is  run  and  yet  produce  incorrect  outputs.  Lose* 
[ 1 978D+ , 1P78E+]  studied  the  efficiency  of  test  procedures  to 
detect  intermittent  failures  in  combinational  circuits.  One 
of  the  factors  contributing  to  the  complexity  of  random 
testing  is  the  fact  that  testing  can  never  guarantee  that  a 
circuit  is  free  from  intermittent  failures.  As  a 
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consequence,  testing  becomes  a  probabilistic  problem  and 
requires  much  information  about  the  statistical  properties 
of  the  failures  and  their  effects  on  the  circuit.  If  all 
this  information  is  available,  then  it  is  possible  to  find 
the  best  test  sequence  and  also  to  evaluate  the  confidence 
that  the  circuit  is  fault-free  at  the  end  of  a  test  sequence 
when  no  error  indication  has  been  detected.  The  best  test 
strategy  is  deterministic:  it  involves  a  given  test 

sequence,  or  more  exactly,  a  series  of  test  sequences, 
depending  upon  the  desired  length  of  the  test.  This  study 
of  detection  of  intermittent  failures  in  combinational 
circuits  leads  one  to  believe  that  the  same  results  may 
appl y  to  sequential  circuits. 

Random  compact  testing  uses  random  inputs  to  test 
digital  circuits.  Losq*  [1978C  +  ]  studied  the  achievement  of 
fault  detection  by  comparing  some  statistic  of  the  circuit 
under  test,  e  .g  . ,  the  frequency  of  logic  ones  at  an  output, 
with  the  value  of  that  statistic  previously  determined  for 
the  fault-free  circuit.  He  concluded  that  random  compact 
testing  can  efficiently  detect  failures  in  both 
combinational  and  sequential  circuits.  Even  though  random 
compact  testing  cannot  guarantee  a  specific  confidence  in 
its  results,  it  is  still  an  efficient  way  to  detect  most  of 
the  failures  that  can  occur  in  circuits.  With  testing 
experiments  of  sufficient  length,  va  very  high  detection 
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coverage  can  be  obtained. 

A  general  method  for  predicting  the  parity  of  the 
output  of  circuits  that  have  one  or  more  N  bit  wide  inputs 
and  a  single  N  bit  wide  output  is  discussed  by 

Khodadac-mostashi r y*  [ 1979A+, 1979B+] .  Error  detection 
properties  of  the  method  are  discussed  for  several  special 
cases,  e  ,g  . ,  bit-sliced  circuits  and  iterative  circuits. 
When  these  circuits  are  fault-free,  the  output  parity  can 
always  be  calculated  in  such  a  way  as  to  preserve  any  single 
input  error.  The  error  detection  properties  of  the  circuit 
depend  on  the  structure  of  the  circuit.  Some  design 
techniques  to  improve  these  detection  properties  are 

presented.  The  output  parity  of  any  bit-sliced  or  iterative 
circuit  can  be  predicted  and  the  resulting  circuit  can  bo 
designed  so  that  it  is  totally  self-checking  with  respect  to 
any  single  stuck-at  fault.  This  property  can  also  be 
demonstrated  for  circuits  which  have  the  following 
restrictions: 

1.  Only  inputs  have  fanout. 

2.  Each  fanout  has  only  two  outputs. 

The  main  advantage  of  this  scheme  over  duplication  is  the 
er ro r- pr ese rv ing  property  which  allows  a  minimal  number  of 
checkers  to  assure  detection  of  permanent  failures.  The 
method  is  especially  suitable  for  byte-oriented  circuits  and 
iterative  circuits. 


13 


Davies*  [1979+]  identified  concepts  and  defined  terms 
useful  in  the  study  of  synchronization  and  data  matching  in 
the  presence  of  faults.  A  new  class  of  faults  having 
particular  effect  on  redundant  systems  was  described.  Three 
general  methods  for  performing  synchronization  and  data 
matching  were  given  and  evaluated.  Three  examples  were 
described  to  illustrate  the  general  method: 

1.  a  fa ul t- tol er an t  clock, 

2.  a  triple  modular  redundant  microcomputer  system,  and 

3.  a  quadruple  redundant  process  controller. 

This  work  provided  a  comprehensive  guide  to  design 
techniques  for  reliable  redundant  systems. 
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3. 


MEETINGS 


i 


In  addition  to  publishing  scientific  articles,  CRC 
personnel  participated  in  various  technical  conferences  and 
meetings  including: 

1.  NAECON,  in  Dayton,  Ohio,  during  May  1377  ,  attended 
by  Prof.  E.J.  McCluskey  and  Dr.  J.J.  Losq, 

2.  FTCS-7,  the  Seventh  Annual  International  Conference 

on  Fault-Tolerant  Computing,  in  Los  Angeles, 
California,  during  June  1S77,  attended  by  Prof. 
E.J.  McCluskey,  Dr.  J.J.  Losq,  Dr.  M.D. 

Beaudry,  Dr.  M.L.  Blount,  onu  P.A.  Thompson, 

3.  the  1978  IEEE  International  Solid-State  Circuits 
Conference,  in  San  Francisco,  California,  during 
February  1978,  attended  by  Prof.  E.J.  McCluskey, 

4.  the  Fifth  Annual  Symposium  on  Computer  Architecture 

in  Palo  Alto,  California,  during  April  1978  , 
attended  by  Prof.  E.J.  McCluskey,  Dr.  M.D. 

Beaudry,  Dr.  M.L.  Blount,  and  P.A.  Thompson, 

5.  FTCS-8 ,  the  Eighth  Annual  International  Conference 
on  Fault-Tolerant  Computing,  in  Toulouse,  France 
during  June  1978,  attended  by  Dr.  M.L.  Blount, 

6.  NCC ,  the  1970  National  Computer  Conference,  in 
Anaheim,  California,  during  June  1978,  attended  by 
Prof.  E.J.  McCluskey  and  Dr.  M.D.  Beaudry, 
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7.  the  Third  USA-Japan  Computer  Conference,  in  San 

Francisco,  California,  during  October  1S78, 
attended  by  Prof.  E.J.  Me  Cl  us  key.  Dr.  M.D. 

Beaudry,  and  P.A.  Thompson, 

8.  aha  Fourth  Symposium  on  Computer  Arithmetic,  in 

Santa  Monica,  California,  during  October  1978, 
attended  by  Prof.  E.J.  McCluskey  and  Dr.  M.D. 

Eeaud  ry, 

9.  GOMAC,  the  Government  Microcircuit  Applications 

Conference,  in  Monterey,  California,  during 

November  1978,  attended  by  Prof.  E.J.  McCluskey, 

1G.  COM PC ON  79,  in  San  Francisco,  California,  During 
February  1979,  attended  by  Prof.  E.J.  McCluskey 
and  Dr.  M.D.  Beaudry, 

11.  the  NASA  workshop  held  by  the  Working  Group  on 

Validation  Methods  for  Fault-Tolerant  Avionics 

and  Controls,  in  Hampton,  Virginia,  during  March 
1979  ,  attended  by  Prof.  E.J.  McCluskey,  and 

12.  the  Second  Annual  Workshop  on  Design  for 
Testability,  in  Boulder,  Colorado,  during  April 
1979  ,  attended  by  Prof.  E.J.  McCluskey. 
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DESCRIPTION  OF  TOE 
CENTER  FOR  RELIABLE  COMPUTING 


GENERAL  IIsFOPNAITON 

The  Center  for  Reliable  Computing,  CRC,  was  created  in  1976  to 
coordinate  the  research  efforts  in  the  domain  of  fault-tolerant 
computing.  The  center  is  part  of  the  Computer  Systems  Laboratory.  The 
Digital  Systems  Laboratory,  created  in  1969,  has  expanded  greatly  since 
that  time  and  it  has  become  appropriate  to  consolidate  all  th '  research 
oriented  towards  reliable  computing  under  a  single  organization.  The 
trend  towards  increased  proliferation  of  computer  systems  in  many 
critical  applications  and  the  continual  demand  for  added  safety  and 
reliability  motivated  this  regroupment. 

The  goal  of  the  Center  is  to  provide  design  methods  and  evaluation 
techniques  that  will  meet  the  designer  needs  for  present-day  and  future 
systems.  The  three  major  areas  where  most  of  the  attention  is  centered 
are:  the  problem  of  testing,  diagnosis  and  recovery;  the  evaluation  of 
the  various  techniques  used  to  enhance  reliability;  and  the  problem  of 
software  correctness. 


BACKGROUND 


This  note  presents  a  short  summary  of  the  reliability  related 
research  performed  the  Digital  Systems  Laboratory  and  continued  at  the 
Center  for  Reliable  Computing.  Initially,  the  research  v/as  focused  on 
two  main  areas:  fault  properties  and  design  of  reliable  operating 
systems. 


Fault  Properties  and  Testing 

Faults  inside  digital  circuits  were  extensively  analysed  for  their 
effects  on  circuit  behavior,  [McCluskey,  1971;  Boute,  1971;  Mei,  1974], 
and  also  for  their  testability,  [Mei,  1970;  Boute,  1972].  This  work  led 
to  procedures  that  provide  efficient  tests  for  combinational,  [Wang, 
1975A;  War.g  Dias,  1975;  Verdillon,  1975:  Verdillon,  1976;  Sziray, 
1977A;  Sziray,  1977B] ,  and  sequential  circuits  [Chesarek,  1972;  Boute, 
1974A;  Boute,  1974B;  Boute,  1975],  Special  types  of  networks,  such  as 
unate  net-'rks  and  iterative  logic  arrays,  were  the  object  of  special 
attestin'  eto  powerful  testing  strategies  ere  developed  for  them 
i3er2r.c~u.-s,  2  =*73;  Dias,  1976] .  An  algebraic  model  of  combinational 
logic  r.stw-.r>3  was  developed  in  order  to  to  simplify  test  generation 
procedures  [Clegg,  1973],  and  was  later  refined  [Resse,  1973],  The  work 
on  fault  properties  led  to  the  investigation  of  the  statistical 
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correctness  of  the  outputs  of  circuits  that  have  suffered  internal 
failures  [Parker,  1975A;  Parker,  1975B;  Ogus,  1974A;  McCluskey,  1978; 
Parker,  3 9-1 5 ]  .  Subsequently,  it  became  possible  to  quantitavely 
evaluate  the  efficiency  of  both  random  testing  techniques 
[Shedletsky,  1975A;  Shedletsky,  1975B;  Shedletsky,  1976A;  Shedletsky, 
1S76B;  Shedletsky,  1977]  and  compact  testing  methods  [Parker,  1976A; 
Losq,  1976C;  Losq,  1977B;  Losq,  1970A]  for  both  combinational  and 
sequential  circuits.  Studies  in  this  area  also  pointed  to  the 
practical  advantages  of  a  dynamic  approach  when  choosing  among  random 
inputs  for  the  candidates  with  the  highest  potential  for  detection  of 
failures  [Parker,  1973;  Parker  19763],  More  recent  work  has  produced 
many  optimal  strategies  to  tost  for  intermittent  failures  [Savir,  1977A; 
Savir,  1977B;  Savir,  1977C;  Savir,  1977D;  Losq,  1970B] . 


Design  of  Reliable  Operating  Systems 


The  original  work  on  operating  systems  was  centered  around  the  problem 
of  parallelism.  Design  and  verification  methodologies  were  obtained, 
[Bredt,  1971A;  Bredt,  1S71B;  Bredt,  1971C;  Bredt,  1973],  and  we  were 
able  to  evaluate  the  various  power  of  parallel  system  models  such  as 
Petri  Nets  and  P-V  systems  [Peterson,  1974].  This  work  also  led  to  the 
formal  proof  of  direct  correspondence  between  Petri  Nets  and  final  state 
machines  and  Turing  machines  [Peterson,  1976].  Subsequently,  the 
investigation  was  broadened  to  include  hierarchical  design  methods 
[Bredt,  1974;  Saxena,  1975]  and  the  associated  verification  problems 
[Saxena,  1976].  New  languages  for  efficient  specification  of  interrupt 
processing  were  developed  along  with  formal  verification  methods  and 
studies  of  the  fault- tolerance  of  real-time  systems  [Phillips,  1976], 
The  resynchronization  problem  for  parallel  systems  following  the 
detection  of  errors  was  also  investigated  [Russel,  1975].  Recent  work 
has  been  focused  on  verifying  the  correctness  of  concurrent  programs 
[Owicki,  1977A;  Owicki,  19778] . 


Ultra-P.el  table  Computer  Systems 

A  third  area  of  study,  on  structures  for  fault-tolerant  systems,  was 
initiated  early  in  the  program.  Hybrid  redundancy  received  much 
attention.  A  very  efficient  design  was  obtained,  [Siewiorek,  1973A] , 
and  careful  reliability  evaluation  was  performed  [Siewiorek,  1973B; 
Ogus,  This  did  point  out  their  extreme  sensitivity  to  the 
unrel  :au ...  :r  ■  ~.f  the  fault-detection  and  switching  mechanisms  [Losq, 
19“5]  .  £_gr.if icant  performance  increase  was  obtained  by  incorporating 
some  fad  :-r~: trance  inside  the  fault-detection  and  switching  mechanisms 
[Ogus,  Alternatives  to  hybrid  redundancy  were  also  actively 
studied.  The  effects  of  compensating  failures  in  HMR  systems  were 
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analyzed  [Sieviorek,  1971;  Siewiorek,  1974].  An  algorithm  for  the  exact 
reliability  evaluation  of  TMR  networks  was  obtained  [Abraham,  1974]. 
Similarly,  interwoven  redundant  systems  were  also  precisely  evaluated 
(Abraham,  1375] .  Self-purging  redundancy  was  shown  to  offer  significant 
advantages  over  hybrid  redundancy  for  many  applications  [Losq,  1974; 
Losq,  1976A;  Losq,  19763] .  A  practical  design  of  a  small  TMR  processor 
using  LSI  technology  has  been  undertaken  [Wakerly,  19753;  Wakerly, 
1975C;  Wakerly,  1976],  Duplex  systems,  because  of  their  broad  use,  were 
also  the  focus  of  much  attention  [Beaudry,  1977A] .  Analytical  models 
were  developed,  [Fregni,  1974A],  and  special  care  was  taken  to 
incorporate  the  effects  of  incomplete  or  faulty  recoveries  [Fregni, 
1974B] .  A  powerful  and  general  simulator  was  developed  in  parallel  with 
the  analytical  models  [Thompson,  1976;  Thompson,  1377A].  This 
simulator,  which  allows  different  subsystems  to  be  described  at 
different  level  of  details,  was  used  to  determine  very  accurately  the 
reliability  of  a  dual  redundant  avionic  system  [Thompson,  1977B; 
Thompson,  1977C] .  A  different  approach  to  evaluate  highly  redundant 
systems  is  presently  being  investigated.  It  consists  of  a  general  and 
automatic  method  to  list  all  the  combinations  of  failures  that  cause 
system  crash  [Losq,  1377C;  Losq,  1977D] . 


Computer  Systems  with  High  Availability 

At  the  same  time,  much  work  was  devoted  to  the  design  and  evaluation 
of  systems  with  more  modest  reluibility  requirements.  Immediate  error 
circuitry  [Usas,  19753].  The  problem  of  detection  of  errors  in  periodic 
signals,  like  clocks,  was  solved  [Usas,  1974;  Usas,  1975A;  Usas,  1976]. 
Currently,  a  design  of  a  LSI-based  version  of  a  self-checking  computer 
is  being  completed  [Wakerly,  1975D]  .  Much  of  the  present  effort  in 
analytical  modeling  is  directed  towards  the  evaluation  of  gracefully 
degradable  systems  [McCluskey,  1977] .  Analytical  models  which  emphasize 
the  computation  reliability  rather  than  the  traditional  system  reli¬ 
ability  have  provided  a  method  to  quantify  the  effects  of  failures 
[Beaudry,  19773;  Beaudry,  1977c  ;  Beaudry,  1977D;  Losq,  1977A] . 
Attempts  have  been  made  to  obtain  very  accurate  models  for  the  the 
efficiency  of  the  various  recovery  strategies  [  Blount,  1977A;  Blount, 
1977B],  Also,  there  is  continued  work  on  peripheral  device  reliability 
especially  through  the  use  of  self-checking  design  [Lu,  1977]  . 


Summer  v 


Reliab 

19^53; 


ar  more  detailed  surveys  of  the  research  at  the  Center  for 
using  have  been  prepared  [McCluskey,  1975A;  McCluskey, 
key ,  1976]. 
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Apologies  are  offered  to  all  researchers,  both  at  Stanford  and 
elsewhere,  whose  work  was  not  included.  Space  does  not  allow  mention  of 
all  the  work  done  at  Stanford  and  no  attempt  has  been  made  to  credit  the 
excellent  research  being  carried  out  elsewhere  because  of  this  same 
limitation. 


FACILITIES 

The  Center  for  Reliable  Computing  is  located  in  the  Electronics 
Research  Laboratory  (ERL),  on  the  Stanford  campus.  The  Center  takes 
advantage  of  the  use  of  several  laboratories,  computing  centers  and 
libraries. 


Laboratories 


-  Digital  Design  Laboratory,  in  ERL 

.  intended  for  small  circuit  design 

-  Microprocessor  Laboratory,  in  ERL 

.  intended  for  the  development,  at  prototype  level,  of  microprocessor 
based  systems 

.  equipped  with  two  Intellec,  floppy  disc  based,  development  systems 

-  Testing  Laboratory,  in  ERL 

.  intended  for  test  generation  for  logic  boards 
.  equipped  with  H.P.  TEST-AID  system 


Computing  Centers 

-  PDP  11/20  system,  in  ERL 

.  intended  for  local  computing  by  the  personnel  of  the  Digital  Systems 
Laboratory 

-  IBM  370/153  system,  Stanford  Center  for  Information  Processing  (SCIP) 

.  large  time- shared  computing  center 


-  PDP  10  based  system.  Artificial  Intelligence  Project 

.  large  t Lc.e-shared  computing  center  running  mostly  LISP  and  SAIL-type 

languages 

-  PD?  lb  based  system.  Low  Overhead  Time-Sharing  System  (LOTS) 

.  la’-g-  c s-e- shared  computing  center,  intended  for  terminal-oriented 

-  Access  so  the  ARPA  network 

.  through  one  IMP  in  ERL  and  one  node  at  the  Artificial  Intelligence 
Project  building 
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Libraries 


-  Stanford  Engineering  Library,  Main  Library  building 

-  Computer  Science  Library,  Polya  building 

-  U.C.  Berkeley  Libraries  through  an  exchange  agreement  between  Stanford 
Libraries  and  the  Libraries  of  the  University  of  California,  Berkeley 


CURRENT  PERSONNEL 


Director 

Faculty 


:  E.J.  McCluskey,  Professor,  Electrical  Engineering 
and  Computer  Science 

:  W.  vanCleemput,  Assistant  Professor,  Electrical 
Engineering  and  Computer  Science 
:  S.S.  Owicki,  Assistant  Professor,  Electrical 
Engineering  and  Computer  Science 
:  D.  Luckham,  Adjunct  Professor,  Electrical 
Engineering 


Senior  Research  Staff:  M.D.  Beaudry,  Research  Associate,  Electrical 

Engineering 


Research  Assistants  :  S.  Bozorgui-Nesbat,  Electrical  Engineering 

:  W.  Cory,  Electrical  Engineering 
:  D.  Davies,  Electrical  Engineering 
:  P.L.  Fu,  Electrical  Engineering 
:  D.  Gifford,  Electrical  Engineering 
:  D.D.  Hill,  Electrical  Engineering 
:  B.  Khodadad,  Electrical  Engineering 
:  D.J.  Lu,  Electrical  Engineering 
:  T.  Minoura,  Electrical  Engineering 
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